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Abstract: Protein interactions play an important role in the discovery of protein functions and pathways in biological 
processes. This is especially true in case of the diseases caused by the loss of specific protein-protein interactions in the 
organism. The accuracy of experimental results in finding protein-protein interactions, however, is rather dubious and 
high throughput experimental results have shown both high false positive beside false negative information for protein in- 
teraction. Computational methods have attracted tremendous attention among biologists because of the ability to predict 
protein -protein interactions and validate the obtained experimental results. In this study, we have reviewed several compu- 
tational methods for protein-protein interaction prediction as well as describing major databases, which store both pre- 
dicted and detected protein-protein interactions, and the tools used for analyzing protein interaction networks and improv- 
ing protein-protein interaction reUabUity. 
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1. INTRODUCTION 

Protein-protein interactions (PPIs) arc of interest in biol- 
ogy because they regulate roughly all cellular processes, 
including metabolic cycles, DNA transcription and replica- 
tion, different signaling cascades and many additional proc- 
esses. Proteins carry out their cellular functions through con- 
certed interactions with other proteins, so it is important to 
know the specific nature of these relationships. Indeed, the 
importance of understanding these interactions has prompted 
the development of various experimental methods used in 
measuring them. While the amount of genomic sequence 
information continues to increase exponentially, the annota- 
tion of protein sequences appears to be somewhat lagging 
behind, both in terms of quality and quantity. Multi-pronged, 
high-throughput functional genomics approaches are needed 
to bridge the gap between raw sequence information and the 
relevant biochemical and medical information. Therefore, 
computational methods are required for discovering interac- 
tions that are not accessible to high throughput methods. 
These computationed predictions can then be verified by us- 
ing more labor-intensive methods. A number of computa- 
tional approaches for protein interaction discovery have been 
developed over recent years. These methods differ in feature 
information used for protein interaction prediction. Many 
studies have demonstrated that knowing the tools and being 
familiar with the databases is important for new research in 
protein-protein interaction analysis to be conducted fl-7]. 
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The scope of this review focuses on describing these re- 
sources. 

2. DATABASES 

Through recent rapid advances in high-throughput tech- 
nologies, massive protein-protein interaction data of various 
organisms have become available and are currently stored in 
several databases. More than 100 PPI related repositories 
have been published and are available online [8], these data- 
bases can be used as the major data for evaluating prediction 
methods. Many of these PPI data providers are independ- 
ently funded and do their works in isolation and often con- 
tain redundant data from overlapping sets of publications. 
The issue of the integrating data from PPI disparate reposito- 
ries began with the efforts of the Human Proteome Organiza- 
tion Proteomics Standards Initiative (HUPO-PSI) and Inter- 
national Molecular Exchange (IMEx) consortium and fol- 
lowed by publishing the 'minimum information about a mo- 
lecular interaction experiment' (MIMIX) guidelines [9]. The 
HUPO-PSI has developed the PSI-MI XML format to estab- 
lish a single, unified format for PPI data. Additionally, a 
simplified tabular format, MITAB has been developed [10]. 
The IMEx is an international collaboration between a group 
of major public interaction data providers who have agreed 
to share litcrature-curation efforts and make a nonredundant 
set of PPI available in a single search interface on a common 
website (http://www.imexconsortium.org/) [8]. IMEx defines 
three types of membership: Active: IMEx partner commits to 
producing relevant numbers of records curated to IMEx 
standard and providing these via a Proteomics Standards 
Initiative common query interface (PSICQUIC) service. Ob- 
server: Prospective IMEx consortium member. Inactive: 
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former IMEx partner that contributed to the establishment of 
the IMEx curation rules. The PSICQUIC is a web service 
aimed at enabling users to access multiple interaction data- 
bases with a single query and standardizing the program- 
matic access to molecular interaction databases [11]. 

(Supplementary table SI) shows almost all active data- 
bases on the PPI, but in the following we focus on the most 
popular repository (which has more than 1000 citations ac- 
cording to Google scolar at the time of writing this manu- 
script, May 2013) in more detail, (see Table 1) for feature by 
feature comparesion of these databases. 

BioGRID [12, 13], the General Repository for Interaction 

Database, is one of the most comprehensive databases of 
experimentally determined protein-protein interactions. It 
has continuously been updating the source of protein and 
genetic interactions from major model organisms (by the 
time of writing this manuscript, Feb-2012, contains 27 dif- 
ferent organisms) compiled through comprehensive curation 
efforts, it comprises more than 460000 interactions and all 
interaction data are freely available for download in a wide 
variety of standardized formats, more over, this repository 
supplies information about the experimental methods used 
for interaction detection. This database does not contain in- 
formation about multi-protein complexes larger than dimers 
and lists any interaction as pairwise interactions. 

The Database of Interacting Proteins (D/F™)[14] devel- 
oped at the University of California, Los Angeles has com- 
bined data from a variety of sources to create a single, con- 
sistent set of PPI. In addition to the primary sources, DIP 
drives its data from a number of other databases such as 
Yeast Protein Database (YPD) [15], EcoCyc [16], and 
FlyNet [17], Kyoto Encyclopedia of Genes and Genomes 
(KEGG)[18]. The complete DIP dataset are freely available 
for download as well as specialized DIP subsets and addi- 
tional data (free registration required), the database contains 
more than 460 organisms. 

The Biomoleculeu" Interaction Network Database (BIND) 
[19, 20], is a component of BOND (the Biomolecular Object 
Network Databank). This repository was created at the Uni- 
versity of Toronto. It contains more than 200,000 interac- 
tions of more than 1500 organisms and it holds a large vari- 
ety of interaction data including those curated by a team of 
curators. Although, the majority of BIND is the protein in- 
teractions data, BIND also contains many other types of in- 
teractions involving RNA, DNA, genes, complexes and 
small molecules. Although BIND curation stopped in 2005, 
BIND still remains a highly cited publicly available interac- 
tion database, because the BIND data is not available in a 
standard format from the official source, recently [21] a 
translation of BIND in the Proteomics Standard Initiative-MI 
(PSI-MI) 2.0 format was publicly available which makes the 
BIND data compatible with current software tools. 

The Molecular Interaction Database {MINT) [22, 23] 
developed by the University of Rome Tor Vergata, interac- 
tion data and various experimental details are mined from 
published literature by using a literature-mining program, the 
MINT assistant, then expert curators establish the putative 
interactions. Currently MINT contains more than 230,000 
interactions and more than 34,000 proteins and focused on 
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the model organisms, this database provides confidence 
scores for experimentally detected PPIs, which show the 
reliability of the interactions. MINT is an active partner of 
IMEx and shares curation efforts and supports the Protein 
Standard Initiative (PSI) recommendation. 

The Human Protein Reference Database {HPRD [24-26]) 
was built as a cooperative effort between Johns Hopkins 
University and the Institute of Bioinformatics, this resource 
provides a collection of human protein-protein interaction. 
Data are manually extracted from the literature, and each 
record is linked to a detailed piece of information including 
post-translational modifications, disease associations via 
OMIM for each protein in the human proteome, subcellular 
localizations, enzyme-substrate relationships, protein iso- 
forms and domain architectures. This database currently con- 
tains more than 30,000 proteins and more than 39,000 pro- 
tein-protein interactions. 

IntAct [27-29] is a molecular interaction database that its 
data come from the literature or from direct data depositions, 
IntAct source code and data are freely available for down- 
load. Currently this resource contains more than 60,000 pro- 
teins and more than 290,000 binary interaction evidences 
abstracted from more than 5000 scientific publications. In- 
tAct is an active partner of the IMEx consortium, and the 
majority of its protein-protein interaction data is annotated to 
IMEx standards. In addition to protein-protein interaction 
data, IntAct also includes information on DNA, RNA, and 
small-molecule interactions. 

3. COMPUTATIONAL METHODS FOR PROTEIN- 
PROTEIN INTERACTION PREDICTION 

In general, the available methods for predicting protein- 
protein interaction can be divided into four main categories: 
methods based on genomic context and structural informa- 
tion, methods that use network topology to predict protein- 
protein interaction, methods that detect protein-protein inter- 
action by using text mining and literature mining (or data- 
base search) and, finally, methods based on machine learning 
algorithms utilizing heterogeneous genomic/proteomic fea- 
tures (see Table 2 for a general overview). In the following 
section we describe each of these methods and their applica- 
tion. 

3.1. Methods Based on Genomic Context and Structure 
Information 

3. 1. 1. Gene Neighboring 

Gene neighboring or co-localization of genes is one of 
the first and simplest methods for protein-protein interaction 
prediction methods based on the genomic context [45-47]. 
The main idea is that related genes are located close to one 
another in the genome (Fig. 1). Like many other genome- 
context approaches, the predictions of this method become 
more confident with ledger numbers of genomes [48]. Con- 
trary to prokaryotic organisms, the tendency of being located 
at a close genomic distance is not evident regarding related 
genes in eukaryotcs, so a major limitation of this method is 
that it is not applicable in the eukaryote genomes without a 
doubt especially when there are no homologues in prokaryo- 
tes. While its simplicity is a benefit, this method may 
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Table 1. The Most Popular Repository (Which Have More Than 1000 Citations According to Google Scolar at the Time of Writing 
this Manuscript, May 2013) on the PPl, for Detailed Description About These Repository Refer to the Text 



Acronym 


Organisms 


Number of 
Interactions 


Curated to 
IMEx/MIMIx 
standards 


IMEx 
Partner 


Accept 
submis- 
sion in 
PSI-MI 
format 


Avail- 
ability 


PSIC- 
QUIC 

service 


Last 
Content 
Update 


URL 


References 


BioGrid 
(IMEx partner) 


Model organisms 


279409 


IMEx 


Yes 
(Observer) 


No 


Free to 
aca- 
demic 
users 


Yes 


2013 


http:// www.thebiogrid.org 


[30-32] 


BIND/BOND 
(IMEx partner) 


H. sapiens, 
S.cerevisiae, 
M. musculus, 
H, pylori 


198,905 


No 


Yes 
(Inactiv) 


Yes 
(submis- 
sion by 
email) 


Free to 

all 
users 


Yes 


2004 


http://download.baderlab.org/ 
BINDTransIation/ 


[21, 33-35] 


DIP/ LiveDIP 
(IMEx partner) 


Model organisms 


73268 


IMEx 


Yes 
(Active) 


Yes 
(submis- 
sion by 
email) 


Free to 
aca- 
demic 
users 


Yes 


2005 


http://dip.doe-mbi.ucla.edu 


[36-40] 


HPRD 


Human 


30,047 


No 


No 


Yes 
(submis- 
sion by 
email) 


Free to 

aca- 
demic 
users 


No 


2009 


http:// www.hprd.org 


[24-26,41] 


IntAct 
(IMEx partner) 


Model organisms 


290,891 


IMEx/MIMIx 


Yes 
(Active) 


Yes 
(submis- 
sion by 
web- 
based 
tool) 


Free to 

all 
users 


Yes 


2011 


http:// www.ebi.ac.uk/intact/ 


[28, 29, 42, 43] 


MINT 
(IMEx partner) 


Model organisms 


241458 


IMEx 


Yes 
(Active) 


Yes 
(submis- 
sion by 
email) 


Free to 

all 
users 


Yes 


2011 


http://mint.bio.uniroma2.it/mint 


[22, 23, 43, 44] 



Table 2. A General Overview of Computational Methods for Protein-Protein Interaction Prediction with Their References, for 
Description About Methods Refer to the Text 



IMethod 


Features 


References 


Methods based on genomic 
context and structure infor- 
mation 


Gene fusion 


• Usually used for small scale proteome. 

• Is not generally applicable to all genes. 

• Fusion event not abundant, especially in prokaryotes. 

• It is very reliable. 


[54, 55] 


Gene neighboring 


• Usually used for small scale proteome. 

• Relatively simple. 

• Prone to produce false negatives. 

• Results dependent on the number and distribution of 
used genomes. 


[46,47, 123] 


Phylogenetic similarity 


• Needs complete genome 

• Results are dependent on the number and distribution 
of used genomes. 

• Cannot be appUed to essential proteins 


[50-52, 124- 
126] 
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Method 


Features 


References 




Sequence and primary structure 


• Relatively simple. 

• Can be used for large scale proteome. 

• Need to interpret the features importance. 


[65-74] 




Structure based 


• Tend to be more limited in terms of scale. 

• Allow a detailed analysis of PPI. 


[56-64, 127- 
129] 


Methods based on machine 
learning algorithms with 
utiUzing multiple ge- 
nomic/proteomic features 


Decision tree and random forest 


• Copes well with high-dimensional data. 

• Copes well with missing values. 

• The pattern in the data can be easily explamed. 


[73, 121, 
122, 130, 
131] 


KNN 


• Simple to understand. 

• Requires no training. 

• The computational cost and memory requirement 
grows rapidly with increasing feature vectors dimen- 
sion. 


[118]. 




MLP 


• Good generalization capabilities. 

• It looks like a black box. 


[110, 111] 




Naive Bays 


• Assumption for independence between features. 

• Simple and easy to interpret. 

• Copes well with missing values. 


[71,74, 
115-117] 




SVM 


• Copes well with high-dimensional data. 

• It is very powerful. 


[70, 104-106] 


Other methods 


Using network topology for predicting 
protein-protein interaction 


• Results can be affected by false positives and network 
completeness. 


[78-85] 




Text mining methods 


• Results may not be reliable as manually curated data, 
but the fast growth of published biomedical literature 
can make these methods more confident. 


[90-97, 99, 
100] 



3.1.2. Phylogenetic Relationship 

In this method, the interaction of proteins will be detected 
based on "phylogenic profile" similarity [50-52], phylogenetic 
profile for a given protein is a binary vector that reflects the 
presence or absence of that protein across a set of organisms 
(Fig. 2), this method is a flexible version of the gene neighbor- 
ing method which can detect some interaction that gene 
neighboring method fails to detect. The basic idea is that func- 
tionally related genes remain together across many distant 
species for playing a role in a biological process. However, 
this powerful method has three important drawbacks. The first 
is that the number and distribution of the genomes that used 
can influence the results dramatically [49] . The second is that 
it cannot be applied to essential proteins that presents in al- 
most all organisms, and third drawback is that this method 
only can run on the complete genomes [48]. 

3.1.3. Gene Fusion 

It has seen that separate related genes, probably to reduce 
the regulatory load of multiple interacting gene products, can 



produce some false negative results because fails to recog- 
nize the interaction between related but distantly located 
genes. Another drawback of gene neighbouring method is 
that the choice of reference genomes can affect the perform- 
ance of the method [49] . 



Orjl 



Org] 



-J 




Fig. (1). Gene neighboring method for protein-protein interaction pre- 
diction, the main idea is that related genes are located close to one an- 
other in the genome. For example the blaclj and blue proteins predict to 
interact (plus sign indicates the interaction between proteins). 
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be fused into a single multi-functional gene, a so-called "Ro- 
setta Stone" protein. For example, topoisomerase II is a fu- 
sion result of Gyr A and Gyr B subunits of Escherichia coli 
DNA gyrase [53]. The gene fusion method uses comparative 
genomics and evolutionary information [54, 55] and so can 
be considered as complement of gene neighboring and 
phylogenetic profile methods (Fig. 3). A major advantage of 
this method is its reliability, because the existing gene fusion 
events are very informative about functional relationship. 
One of the drawbacks of this method is that fusion event not 
abundant, especially in prokaryotes [48]. 
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Fig. (2). Protein-protein interaction detection based on phylogenic 
profile similarity. Phylogenetic profile for every protein is a binary 
vector that reflects the presence or absence of that protein across a 
set of organisms, (plus sign indicates the interaction between pro- 
teins). 
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Fig. (3). Gene fusion method for protein-protein interaction detec- 
tion. In this method, the complete genome comparison must be 
done across multiple organisms: if two separate proteins in one 
organism fused in another organism then one can conclude they 
interact, (plus sign indicates the interaction between proteins). 

3.1.4. 3D Structure Based 

This method uses three-dimensional (3D) structure in- 
formation to predict interactions, the information from pre- 
dicted interactions also can be used for predicting interaction 
between new proteins that are homologous with previously 
predicted interacting proteins [56-63]. This method need 
accurate dimensional structures (Fig. 4) and so cannot be 
used for a large number of proteins, because the number of 
known 3D structures in the Protein Databank (PDB) is lim- 
ited, recently a genome-wide scale method based on struc- 
ture information for PPI prediction that have used homology 
models when three dimensional structures were asbsent [64]. 
In comparison to other methods the results of this method 
have more details such as interacting residue and biophysical 
characteristics of the interaction. 




Fig. (4). Use three-dimensional (3D) structure information to pre- 
dict interactions, this method need accurate dimensional structures. 

3.1.5. Primary Structure 

The information gained from protein sequences have 
been used in a number of Bioinformatics researches such as 
protein subcellular localization or protein recognition, in 
recent year sequence information also used for protein- 
protein interaction prediction [65-74]. Primary protein struc- 
ture approaches predict protein-protein interaction typically 
based on the short conserved polypeptide such as signatures 
[66, 67, 69] or sequence similarity and k-let count (subse- 
quences with length k) [70, 74-77]. 

3.2. Methods Based on Network Topology 

Like in many real-world networks, protein-protein inter- 
action networks in various organisms share common topo- 
logical features which make these networks different from 
random networks. These topological features have been used 
as evidence to discern the difference between interactions 
that represent true positives and those that are false positives. 
These have allowed researchers to assign an improved confi- 
dence score to each interaction [78]. 

Analyzing the PPI networks from topological perspective 
is crucial for a better understanding of the underlying evolu- 
tionary mechanisms and network dynamics that shape the 
network. Because the network theory is a relatively new 
field, so for determining the significance of topological 
properties in a given PPI network, the properties are com- 
pared against those in random networks and then confidence 
scores are assigned to PPIs. Finally based on these scores 
some of intearctions can be eliminate and some other can be 
added to the network [79-85]. One problem is how to create 
random networks for comparison, usually the number of 
vertices and edges are held constant so that we can determine 
which properties are significant. 

A random graph model is a model for generating graphs 
by random process that uses graph theory and probability 
theory. Random graph was defined in 1959 in two independ- 
ent studies for the first time [86, 87]. ErdOs-Renyi model is 
a model for a random graph generation, in this model the 
number of vertices in the random network is equals the num- 
ber of vertices in the original protein-protein interaction net- 
work and the probability of an edge existing between any 
two vertices is equal to the edge density and is independent 
of other edges, so roughly speaking this model generates a 
network with the same number of edges and vertices. But 
this model is not a suitable random model for determining 
the significance of protein-protein interaction network prop- 
erties, because many topological properties of these net- 
works are different from protein-protein interaction net- 
works. 

Some of the most important topological concepts of pro- 
tein-protein interaction networks are as follows: (in this sec- 
tion we consider the protein-protein interaction network as a 
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graph G=(V, E) in which V and E are vertex set and edge set 
of the graph respectively, we also suppose that the number of 
vertices are n and the number of edges are e) 

Degree: for a vertex v shows the number of its interactions 
and represent as deg(v) 

Degree Distribution: indicate the number of vertices whit 
different degree. 

Hub: nodes with high degree called hub which suggested 
having an important role in cellular processes. 

k-core: A subgraph of the network where every vertex in the 
subgraph has degree greater than k-1 within that subgraph. 

Edge Density: is the ratio of the number of network edges to 
the maximum possible number of edges that can be com- 
puted as equation 1 : 



1. ED = 



n(n - 1) 



Clustering Coefficient (CC): for a vertex such as v if Ey 
shows the meiximum number of possible interactions be- 
tween neighbors of v and shows the number of neighbors 
of V that interact with each other then clustering coefficient 
for V is defined as follows: 

F 

2. cc(v) = ^ 

For a network clustering coefficient define as the average 
clustering coefficient over all vertices: 



3. CC = -y cc(v) 



Path length: the least number of edges needed to reach from 
one vertex to another called the path length between them. 

Average path length: the average path length over all possi- 
ble vertex pairs in the network. 

Diameter: the maximum path length in the network. 

Centrality: in general centrality is a structural attribute of 
nodes or edges that shows the importance of those nodes or 
edges in the network. There are many centrality measures, 
however, we describe three of the most popular centraUty 
measures below: 

Degree centrality: it is the simplest centrality measure, 
which is defined as the number of edges incidences with that 
node, in the normalized version the degree is divided by the 
maximum possible degree. 

Closeness centrality: this measure is based on the distance of 
a node to all other nodes in the graph and is precisely defined 
as follows: 



4. c rv): 



1 



^^dist{u,v) 



Betweenness centrality: betweenness centrality for a node v 
is defined as the number of shortest paths passing through v, 
this measure can be defined for an edge in the same way. A 



protein with high betweenness centrality value has great in- 
fluence over information flows in the whole network. In 
spite of closeness centrality, betweenness centrality can be 
used for disconnected networks. 

Motif: A subgraph of the network which its occurrences are 
significantly high (more than expected at random). 

Protein-protein interaction networks of different species 
interestingly have many common topological features. PPI 
networks are also said to have a power-law degree distribu- 
tion, which means there are a few nodes with many connec- 
tions and many nodes with few connections and so the de- 
gree distribution of the PPI networks is heavy-tailed (power- 
law degree distribution)[88, 89]. Another feature in protein- 
protein interaction networks compared to random networks 
is having a high clustering coefficient: the interaction prob- 
abihty of the neighbors of two interacting proteins is signifi- 
cantly high. Unlike the high clustering, the average path 
length in the PPI networks is short. Protein-protein interac- 
tion networks are referred to as small-world networks be- 
cause of having a high clustering coefficient and a short av- 
erage path length. 

In addition to being used to predict protein-protein inter- 
action, topological properties of protein-protein interaction 
networks have been used to predict proteins function, finding 
protein complexes and finding functional modules. 

3.3. Methods Based on Text Mining and Literature Min- 
ing 

PubMed is expanding at the rate of approximately one 

paper every thirty seconds; this fact shows the importance of 
the biomedical literature mining approaches. Some methods 
use text mining and literature mining algorithms and use the 
information of co-occurrence of the proteins in the PubMed 
abstracts approaches for protein-protein interaction predic- 
tion [90-100]. In general, each literature mining system con- 
sists of three steps (Fig. 5): 

Named Entity Recognition or NER step, it does the identi- 
fication task of protein names which is a crucial step for fur- 
ther analyzing. Zoning step, in which the text is split into 
basic building blocks and sentences are extracted from the 
text. Protein-protein interaction extraction step that uses 
various algorithms to infer protein-protein interaction. Cur- 
rent biomedical literature mining approaches for detecting 
protein-protein interactions can be divided into three catego- 
ries: 

Computational natural language processing (NLP) and 

linguistics-based methods, which define a grammar and use 
parsers to detect protein-protein interaction. Rule-based 
methods, these methods infer protein-protein interaction us- 
ing a set of context specific rules or patterns. Machine learn- 
ing approaches which don't need rules or grammar but some 
classifiers learn the pattern that enables them to identify pro- 
tein-protein interaction from a training set. 

This automated data mining results may not be as reliable 
as manually curated data, but the fast growth of published 
biomedical literature can make these methods more confi- 
dent. (Supplementary Table 2) shows literature and text min- 
ing tools for protein-protein interaction. 
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3.4. Methods Based on Machine Learning Algorithms 
with Utilizing Heterogeneous Genomic/Proteomic Fea- 
tures 

Some other methods use heterogeneous biological data 
such as gene expression, codon usage [71, 101], k-let count 
(subsequences with length k) [70, 74-77] and physicochemi- 
cal properties of amino acids [77, 102] to learn a model for 
predicting PPL These methods integrate biological data 
sources provided by high-throughput technologies to feature 
vectors and use machine learning approaches to learn and 
predict PPI from these feature vectors (some of these meth- 
ods mentioned previously as another method specially as 
sequence based methods, but now we focus on machine 
learning algorithms that were used in those studies). Gener- 
ally speaking, a machine learning algorithm (classifier) for 
protein-protein interaction prediction uses a set of various 
features (or descriptors) of the proteins or protein pairs with 
known interaction and non-interaction as learning set to learn 
which proteins interact and which do not interact, and then 
the algorithm can classify new protein pairs to interacting or 
non-interacting classes. There are particular machine learn- 
ing algorithms used to address protein-protein interaction 
prediction problem, we shall briefly describe these methods 
in the following context. 

Support vector machines (SVM) or kernel machines are 
widely used in bioinformatics and computational biology for 
classifying biological data [103] as well as protein-protein 
interaction prediction [70, 104-106]. The support vector ma- 
chine (SVM) classifier is underpinned by the idea of maxi- 
mizing the margins. Intuitively, the margin for an object is 
related to the certainty of its classification (see Fig. 6). Ob- 
jects for which the assigned label is correct and highly cer- 
tain will have large margins and objects with uncertain clas- 
sification are likely to have small margins [107]. An SVM 
can be trained using a labeled training dataset, each data 
marked as belonging to one of two classes, to build a model 
that could predict class labels for new examples. SVM is 
extremely powerful and can classify problems with arbitrary 
complexity, but it is complex and has large memory re- 
quirements also it is a little slow to train and evaluate. An- 
other drawback of this classifier is that the parameters can 
greatly impact the results [103]. For more details, we refer 
the interested reader to [108, 109]. 

Artificial neural networlis (ANNs or simply NNs) 
originated from the idea to model mathematically human 



Current Genomics, 2013, Vol. 14, No. 6 403 

intellectual abilities by biologically plausible engineering 
designs. One of the most popular NN models is the multi- 
layer perceptron (MLP) [107], MLP is a tool used for model- 
ing PPI with good performance [110, 111]. However, MLP 
has been criticised as being a black-box classifier because it 
is difficult to know what the model parameters mean [112]. 
An MLP is a feedforward artificial neural network, which 
consists of multiple layers and each layer is fully connected 
to the next layer with weighted edges. Typically there are 
three layers: input layer, hidden layer (intermediate layer) 
and output layer, each node at the hidden and output layer is 
a neuron with an activation function, this node contains the 
processing units of an MLP. The weights of the edges are 
optimized and adjusted on the training dataset to minimize 
classification error using a supervised learning approach. 
(Fig. 7) shows a schematic representation of a MLP for pro- 
tein-protein interaction prediction. For more details, we refer 
the interested reader to [113]. 

Naive Bayes is a probabilistic classifier that is based on 
Bayes' theorem and it is a popular algorithm owing to its 
simplicity (the source of simplicity is the assumption that the 
independent variables are statistically independent.), compu- 
tational efficiency and easy to interpret. In spite of the sim- 
plicity of this classifier, it turns out that Naive Bayse works 
quite well in problems involving normal distributions, which 
are very common in real- world problems. Naive Bayes clas- 
sifiers can be trained efficiently on a small training dataset in 
a supervised learning approach using maximum likelihood, 
but in the more complex classification problem it may work 
not well [114]. This method has been widly used in PPI pre- 
diction problem [71,74, 115-117]. 

K-Nearest neighbors (K-NN) is one of the simplest ma- 
chine learning classifiers that is a prototype method for clas- 
sifying objects, which assign labels to each object based on 
the K closest objects (parameter K must be set by user) in 
the feature space according to majority vote, in contrast to 
other statistical methods, K-NN requires no explicit training 
(because the choice of K is very crucial in this method, optimiz- 
ing K can be considered as a kind of learning). In spite of its 
simplicity to implement, when a large data set or numerous 
features are used the computational cost and memory re- 
quirement grows rapidly. This method has been used (not 
widely) in PPI prediction problem [118]. For more details, 
we refer the interested reader to [119]. 
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Fig. (5). The schematic text mining approaches for protein-protein interaction prediction, In general, each literature mining system consists 
of three steps: Named Entity Recognition or NER step, it does the identification task of protein. Zoning step, in this step the text split into 
basic building blocks and extract sentences from the text. PPI step that uses various algorithms to infer protein-protein interaction. 
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Fig. (6). Schematic view of SVM classifier: At the first step, using 
a function that is called "kernel function" (and is denoted by <t>) the 
protein pairs are transformed into points in a new space (presuma- 
bly making the classification easier in this space). Then, the best 
separating hyperplane (separating line in this figure) is selected as 
the boundary of two classes, in this example each three thin black 
lines and the thick green line are separating lines. The margin for a 
separating hyperplane is the shortest distance from that hyperplane 
to the closest positive or negative example; in this figure the margin 
of thick green line is denoted by 'w'. The best separating hyper- 
plane is the one with the maximum margin; in this example the 
thick green line is the best separating line (closest points to the best 
separating hyperplane are called support vectors, the support vec- 
tors are circled in the figure). 



Hidden layers 




Fig. (7). A Multilayer Perceptron (MLP), which consists of multi- 
ple layers and each layer is fully connected to the next layer with 
weighted edges. Typically there are three layers: input layer, hidden 
layer (intermediate layer) and output layer, the features (feature 
abbreviated as ftr in the figure) of every protein pairs is delivered to 
the input layer for classifying. Each node at the hidden and output 
layer is a neuron with an activation function, these nodes are the 
processing units of an MLP. Neuron at output layer classify input 
protein pair into interacting (which is denoted by a '+' sign between 
two proteins) or non-interacting (which is denoted by a '-' sign 
between two proteins) class. The weights of the edges are optimized 
and adjusted on the training dataset to minimize classification error 
using a supervised learning approach. 

Decision tree or classification tree is a popular machine 
learning classifier, which has great applications in bioinfor- 
matics and computational biology and has shown to be one 
of the best classifier for protein-protein interaction predic- 
tion. In these trees, internal node test features, each branch 
correspond to feature value and finally leaves assigns a class 



label (Fig. 8). In the training phase, training dataset is parti- 
tioned into the subsets according to the feature values and 
this process is recursively done on the subsets until splitting 
no effect on the classification. Concerning various aspects of 
optimality, constructing the optimal decision tree is an NP- 
complete problem and so practical decision tree construction 
algorithms such as ID3, C4.5 and CART employ a heuristic 
search [114]. In addition it is efficient from computational 
cost and memory requirements points of view. This classifier 
is prone to overfitting and in some applications it may not 
have good generalization, but compared with other classifi- 
ers such MLP the pattern in the data can be easily explained 
with classification trees [112], 




Interaction No Interaction 



Fig. (8). Decision tree or classification tree, in these trees internal 
nodes tests features (in each internal node corresponding feature is 
compared with a value), each branch corresponds to features value 
and finally leaves assigns a class label (positive or negative for 
interacting and non-interacting respectively). 

Random forest (RF) algorithm is a classification method 
that consists of many decision trees (Fig. 9), in training 
phase each tree is constructed based on random feature vec- 
tors sampled from a data set independently and for every 
node in a tree, a small fraction of the variables are randomly 
selected and then each classification tree is completely 
grown. To classify a new object, put the input vector down 
each of the trees in the forest, and finally according to the 
majority voting one class is assigned to the object. The RF is 
a practical classifier when there are a large dataset and large 
number of features and no need to feature selection or fea- 
ture deletion, also it can rank features according to impor- 
tance for classification. In addition RF can be used for re- 
covering missing data, but in some databases containing 
noisy data RF may be overfit [114]. Decision trees and ran- 
dom forest are widely used in bioinformatics and computa- 
tional biology for classifying biological data [120] especially 
for PPI prediction [73, 121, 122]. 
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Fig. (9). The random forest (RF) classifier, which consists of many decision trees. To classify a new object, put the input vector down each 
of the trees in the forest, and finally according to the majority voting one class assign to the protein pair. 



4. RESULTS ASSESSMENT 

The need for gold standard data sets, which contain both 
positive and negative interactions, to evaluate performance 
of methods for PPI prediction is a critical problem in protein- 
protein interaction prediction. Each available database of 
protein-protein interaction contains positive interactions 
which may also include many false positives (interactions 
that are not biologically real and are produced due to tools 
biases and errors). The most complex part is in selecting 
negative examples (non-interacting proteins), benchmark 
data can affect the performance results and may lead to over- 
estimating the prediction performance [132, 133]. Recently 
two web based systems have been developed for construct- 
ing benchmark protein-protein interaction data [134, 135] 
and some high quality PPI data sets have been published for 
model organisms [136-138]but researches about constructing 
gold standard datasets for PPI prediction had conflicting re- 
sults [132, 133, 139] and it seems there is need for more ef- 
forts. After constructing a gold standard dataset, the predic- 
tion performance could be assessed with different measures 
based on four following basic parameters: 

TP (True Positive or hit): the number of interactions pre- 
dicted correctly. 

TN (True Negative or correct rejection): the number of non- 
interactions predicted correctly. 

FP (False Positive or type I error): the number of non- 
interactions predicted incorrectly as interaction. 

FN (False Negative or type II error): the number of interac- 
tions predicted incorrectly as non-interaction. 



(Table 3) lists the important measures for evaluating pre- 
diction methods. In addition to these measures, one popular 
graphical tool for assessing the classification performance is 
ROC (receiver operating characteristic) curve which plots 
sensitivity (true positive rate) vs. one minus the specificity 
(true negative rate), which each of them changes between 0 
and 1 . The ROC curve shows the tradeoff between sensitiv- 
ity and specificity (see Fig. 10), and the closer the curve fol- 
lows the left-hand border and then the top border of the ROC 
space, the more accurate the classifier, and the closer the 
curve comes to the diagonal of the ROC space, the less accu- 
rate the classifier and closer to the random classifier. The 
area under the ROC curve (AUC or "Area Under Curve"), is 
another measure of classification accuracy, the closer the 
AUC to one the more accurate the classification. It is argued 
that reporting accuracy and precision can be misleading but 
AUC has proved to be a reliable performance measure for 
imbalanced problems like PPI prediction [140, 141] there are 
many tools for evaluating and visualizing the performance of 
classifiers [142, 143]. 

5. TOOLS FOR ANALYZING AND VISUALIZING 
PROTEIN-PROTEIN INTERACTION 

After constructing the protein-protein interaction, re- 
searchers need to visualize and analyze the networks. In re- 
cent years, many tools and software tools have been devel- 
oped for this purpose; (Table 4) briefly discusses some of the 
popular tools used for the analysis and visualization of bio- 
logical networks. In the following section, some of the most 
popular instances of these tools are described in detail. 

Cytoscope [144-146] is a free software package, which is 
one of the most popular protein-protein interaction visualiza- 
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Measure 


Description 


Formula 


Precision 


Measures what fraction of the positive inter- 
action prediction is correct. 


TP 
TP + FP 


Accuracy 


Measures the accuracy of the predictor with 
assigning the same weight to positive and 
negative interactions. 


TP + TN 
TP + FP + TN + FN 


Error rate 


Measures the error rate of the predictor with 
assigning the saine weight to positive and 
negative interactions. 


FP+FN 
TP+FP + TN + FN 


Sensitivity (recall or coverage) or 
TPR (true positive rate) 


Measures what fraction of the real positive 
interactions was correctly identified by the 
predictor. 


TP 
TP + FN 


Specificity (True Negative Rate) 


Measures what fraction of the real negative 
interactions was correctly identified by the 
predictor. 


TN 
TN + FP 


Matthews Correlation coefficient 

(MCC) 


Measures correlation between the actual and 
predicted interactions. 


TPxTN -FPxFN 
^{TP + FP){TP + FN){TN + FP){TN + FN) 



tion and data integration tools. It has many interesting fea- 
tures such as custom node graphics and attribute equations, 
with these features the user can project images onto nodes 
and use spreadsheet-like functionality capability for more 
enhanced network visualization (supplementary Fig. SI). 
This software provides complex network searches, filtering 
operations and many other analysis options. 



1 




0 False Positive Rale 1 

Fig. (10). The ROC curve is a plot of the true positive rate against 
the false positive rate, it shows the tradeoff between sensitivity and 
specificity. The closer the curve follows the left-hand border and 
then the top border of the ROC space, the more accurate the classi- 
fier, and the closer the curve comes to the diagonal of the ROC 
space, the less accurate the classifier and closer to the random clas- 
sifier. If two ROC curves do not intersect, the upper one dominates 
the other (in this example the orange curve is the best and the blue 
curve is the worst). 



Medusa is a powerful Java standalone application for 
visualization of large-scale biological networks in 2D, it also 
implements various clustering algorithms: k-Means, spectral, 
predefined clustering and affinity propagation. It is very inter- 
active and uses a variety of layout and methods (Grid, random, 
circular, hierarchical, fruchterman-reingold, spring embed- 
ding, distance geometry and parallel coordinates) for more 
intuitive visualizations. It also supports varieties of graphs 
such as weighted and unweighted multi-edged directed and 
undirected graphs. Medusa has some other interesting fea- 
tures: it is compatible with many other tools, have up tolO 
types of connections, have search functionality, the user can 
collapse/expand nodes and provides color schemes. This soft- 
ware allows users to load an arbitrary image as a background 
for more descriptive visualizations (supplementary Fig. S2). 

NAViGaTOR is a graphing tool for the 2D and 3D visu- 
alization of biological networks which has been imple- 
mented in Java and is freely available for researchers, it can 
be installed on Windows, Mac, Linux and Unix. (Supple- 
mentary Fig. S3) shows NAViGaTOR interface and one ren- 
dered network. This software uses hardware acceleration to 
facilitate the visualization of large networks. It supports 
some popular data interchange formats, such as PSI-MI, 
BioPAX and GML makes it compatible with other tools. 
NAViGaTOR includes many functions for network analysis 
and visualizing options and allows the user to generate high 
quality images, it also can be extended through an applica- 
tion programming interface (API). 

6. FUTURE DIRECTION AND CONCERNS: EVOLU- 
TION OF PROTEIN-PROTEIN INTERACTION 
NETWORKS 

Protein-protein interaction network is highly dynamic 
[168] and studying the evolution of protein-protein 
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Table 4. The Most Popular Available Tools for Analyzing and Visualizing Protein-Protein Interaction Networks with Brief De- 
scriptions and References. OS Column Shows Operating System(s) Including Linux (Lin), Macintosh (Mac) and Windows 
(Win) that the Tools Can Run on it (Them). The Popularity of Tools Range from One Star to Five Stars Based on the 
Numbers of Corresponding Publications' Citations (According to Google Scholar at the Time of Writing this Manuscript, 
May 2013): One Star Means Less than 50 Citations; Two Stars Mean Between 50 and 100 Citations; Three Stars Mean 
Between 100 and 200 Citations; Four Stars Mean Between 200 and 500 Citations, and Four Stars Mean Greater Than 500 
Citations 



Acronym 


Description 


OS 


Availability 


Popularity 


Stand- 
alone/Web- 
based/Plug-in 


URL and References 


APID (Cancer 
Resectrch 
Center) 


(Agile Protein Interac- 
tion DataAnalyzer) it's 
an interactive web-tool 
that allow exploration 
and analysis of protein- 
protein interaction 


i^in/ividc/ w in 


Free 




^Veb-based 


http://bioinfow.dep.usal.es/apid/index.htm 
[147] 


BiNoM 


It developed to facilitate 
the manipulation of 
biological networks 
represented in standard 
systems biology for- 
mats (SBML, SBGN, 
BioPAX) and to carry 
out studies on the net- 
work structure. 


Lin/MacAVin 


Free 


** 


Cytoscape 
plug-in 


https://binom.curie.fr/ [148] 


BioLayout 


BioLayout is a tool for 
visualization and clus- 
tering of biological 
networks in both 3D 
and 2D, it is compatible 
with Cytoscape. It also 

includes analytical 
approaches to microar- 
ray data analysis. 


Lin/MacAVin 


Free 




Stand-alone 


http://www.biolayout.org. [149, 150] 


Cerebral 


It enhances Cytoscape's 
functionality by using 
extra annotation pro- 
vided by the user to 
both automatically 
generate a more path- 
way-like representation 

of a network and to 
provide an environment 
for the visualization, 
comparison, and clus- 
tering of expression 
data from multiple 
conditions. 


Lin/MacAVin 


Free 


* 


Cytoscape 
plug-in 


http://www.pathogenomics.ca/cerebral/ 
[151] 


Cytoscape 


A powerful interactive 
open source network 
visualization tool 


Lin/MacAVin 


Free 




Stand-alone 


http://cytoscapeweb.cytoscape.org/ [144, 
146] 


InterProSurf 
(University of 
Texas Medical 
Branch) 


wWeb server for pre- 
dicting the functional 
sites on a protein sur- 
face 


Lin/MacAVin 


Free 


* 


Web-based 


http://curie.utmb.edu/prosurf.htmlJ152] 
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Acronym 


Description 


OS 


Availability 


Popularity 


Stand- 
aloneAVeb- 
based/Plug-in 


URL and References 


Interviewer 
(Inha Univer- 
sity) 


Produces a molecular 
interaction network of 
good quality without 
computing force be- 
tween every pair of 
nodes 


Win 


Free 


* 


Stand-alone 


http://interviewer.inha.ac.kr/ [153] 


iSPOT (Uni- 
versita di 
Roma) 


SPOT (Sequence Pre- 
diction Of Target) infer 

the peptide binding 
specificity of any mem- 
ber of a family of pro- 
tein binding domains. 


Lin/Mac AVin 


Free 


* 


Web-based 


http://cbm.bio.unu-oma2.it/ispot/J154] 


MCODE 


It finds clusters (highly 
interconnected regions) 
in a network. It can also 

be used to lay out any 

graph that requires 
stratification according 

to some characteristic 
and thus can be used by 
researchers in a variety 
of fields 


Lin/Mac AVin 


Free 




Cytoscape 
plug-in 


http://baderlab.org/Software/MCODE [155] 


Medusa 


An powerful interactive 
tool for visualization 
and clustering analysis 
of biological networks. 


Lin/Mac AVin 


Free 


* 


Stand-alone 


https://sites.google.com/site/medusa3visuali 
zadon [156] 


meta-PPlSP 

(Florida State 
University) 


A web server for pro- 
tein-protein interaction 
site prediction (this tool 
is built on three indi- 
vidual web servers: 
cons-PPlSP, PINUP, 
and Promate). 


Lin/MacAVin 


Free 


** 


Web-based 


http://pipe.scs.fsu.edu/meta-ppisp.html 
[157] 


NAViGaTOR 
(University of 
Toronto) 


Software package for 
visuaUzing and analyz- 
ing protein-protein 
interaction networks 


Lin/MacAVin 


Free 


** 


Stand-alone 


http://ophid.utoronto.ca/navigator/ [158] 


NOXclass 
(Max-Planck- 
Institutfilr 
Informatik) 


A classifier identifying 
protein-protein interac- 
tion types implemented 
using a SVM algorithm. 


Lin/MacAVin 


Free 


*** 


Web-based 


http://noxclass.bioinf.mpi-inf.mpg.de/ [159] 


Osprey 


a A tool for visualiza- 
tion and manipulation 
of complex interaction 
networks 


Lin/MacAVin 


Free (regis- 
teration is 
needed) 


**** 


Stand-alone 


http://biodata.mshri.on.ca/osprey/servlet/Ind 
ex [160] 


Pajek 


is a standalone applica- 
tion, Ut can use for 
analyzing large net- 
works with up to mil- 
Mon of nodes and verti- 
ces. 


Win 


Free 




Stand-alone 


http://vlado.fmf.uni- 
Ij.si/pub/networks/pajek/ [161, 162] 
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Acronym 


Description 


OS 


Availability 


Popularity 


Stand- 
aloneAVeb- 
based/Plug-ln 


URL and References 


PathBLAST 
(Whitehead 
Institute) 


It ssearches the protein- 
protein interaction 
network of tlie target 
organism to extract all 
protein interaction 
patiiways tiiat align 
with a pathway query. 


Lin/MacAVin 


Free 


**** 


Web-based 


http://www.pathblast.org/ [163] 


SCOWLP 
(TU Dresden) 


Structural classification 
of protein binding re- 
gions for atomic com- 
parative analysis of 
protein interactions 
allow individual and 
comparative analysis of 
protein interactions. 


Lin/MacAVin 


Free 


* 


Web-based 


http://www.scowlp.org/scowlp/ [164, 165] 


UVCLUSTER 


Iterative cluster analysis 
of protein interaction 
data. 


LinAVin 


Free for 
academic 
users (ful- 
fillment of 
software 
license 
agreement 
is needed) 


*** 


Stand-alone 


http://www.uv.es/genomica/UVCLUSTER/ 
[166] 


VisANT 


It designed specifically 
for the integrative vis- 
ual data-mining of 
multi-scale Bio- 
Network/Pathways. It 
also can find the over- 
represented GO terms 
in network modules. 


Lin/MacAVin 


Free 


** 


Cytoscape 
plug-in 


http://visant.bu.edu/ [167] 



interaction networlcs is one of the central problems of sys- 
tems biology, the results of such researches are crucial for a 
better understanding of the evolution of living systems and 
could be used for protein interaction and function prediction. 

Statistics Versus Comparative Approaches 

Generally, it is possible to categorize studies on protein- 
protein interaction network evolution in two ways: those 
based on a statistical and mathematical models, and those 
based on a comparative network analysis. In approaches 
based on statistical and mathematical models after analyzing 
protein-protein interaction networks (by focusing on the 
topological features) mathematical and statistical models of 
evolving networks is produced and then by tuning parame- 
ters we proceed to reproducing properties observed in ex- 
perimentally produced networks. In approaches based on 
comparative network analysis protein-protein interaction 
networks of species with different levels of complexity are 
analyzed and then by comparing networks we try to find the 
evolutionary processes that generally shaped these networks 
[169-171]. In both of these approaches, three main evolu- 



tionary events are considered as the main processes that have 
shaped the structure of the protein-protein interaction net- 
work. 

Addition of New Nodes 

Gene duplication is an important evolutionary mecha- 
nism that naturally increases the number of proteins in the 
protein-protein interaction networks. A gene duplication 
event therefore corresponds to the addition of a node and 
with links identical to the original node, followed by the di- 
vergence of some of the initially redundant Unks between the 
two duplicate nodes [172]. After gene duplication, a protein 
product that has the abihty to bind strongly to its partner will 
be better able to explore mutations that allow it to co-evolve, 
or to dimerize with other, existing, homologues using the 
ancestral binding mode. This duplicating effect should, 
therefore lead to an enhanced ability to create homologous 
interacting pairs of proteins, and could have played a role in 
the early emergence of protein-protein interaction networks. 
This should allow for an increased resistance to environ- 
mental change, or adaptability However, single gene duplica- 
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tions may lead to em immediate stoichiometric imbalance, 
which would therefore tend to be counter selected [173]. In 
smaller gene family sizes, especially for genes encoding pro- 
tein complex components, there is less potential for paralogs 
to evolve new interaction partners through mutation and 
selection. 

Addition and Elimination of Edges 

There are two reasons for the loss and gain of new inter- 
actions, namely neofunctionalization and subfunctionaliza- 
tion. After gene duplication, the second copy of the gene is 
relatively free from selective pressure and is able to diverge 
and accumulate mutations faster than a functional single- 
copy gene because these mutations often have no deleterious 
effects [174, 175]. If these mutations are solely degenerative 
in nature then this will lead to a nonfunctional gene product, 
but if instead they are innovative, then this can lead to neo- 
functionaUzation and the acquisition of novel features. In 
another possible trajectory, some of the functions of the 
original gene are assigned to the new copy and both copies 
accumulate degenerative mutations leading to a differentia- 
tion of function and division of labor (i.e. subfunctionaliza- 
tion). The result of these processes is the divergence at inter- 
action pattern of the original gene and its copy (ie. the addi- 
tion and elimination of edges). 

Elimination of Nodes 

After gene duplication the mutation that takes place on 
the new copy of a gene could convert it to a nonfunctional 
gene, consequently it is deleted from the network because it 
does not have any interaction (gene lost). 
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